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Abstract. Term frequency normalization is a serious issue since lengths of doc¬ 
uments are various. Generally, documents become long due to two different rea¬ 
sons - verbosity and multi-topicality. First, verbosity means that the same topic is 
repeatedly mentioned by terms related to the topic, so that term frequency is more 
increased than the well-summarized one. Second, multi-topicality indicates that 
a document has a broad discussion of multi-topics, rather than single topic. Al¬ 
though these document characteristics should be differently handled, all previous 
methods of term frequency normalization have ignored these differences and have 
used a simplified length-driven approach which decreases the term frequency by 
only the length of a document, causing an unreasonable penalization. To attack 
this problem, we propose a novel TF normalization method which is a type of 
partially-axiomatic approach. We first formulate two formal constraints that the 
retrieval model should satisfy for documents having verbose and multi-topicality 
characteristic, respectively. Then, we modify language modeling approaches to 
better satisfy these two constraints, and derive novel smoothing methods. Experi¬ 
mental results show that the proposed method increases significantly the precision 
for keyword queries, and substantially improves MAP (Mean Average Precision) 
for verbose queries. 


1 Introduction 

The highly-performed retrieval models rely on two different factors - TF (term fre¬ 
quency) and IDF (inverse document frequency). Among them, TF factor becomes a 
non-trivial, since long-length documents may increase term frequency, different to short- 
length ones, so that the naive estimation of term frequency would not be successful. 
Thus, term frequency of long-length documents should be seriously considered. Re¬ 
garding this, Singhal observed the following two different types of reasons for making 
the length of a document long [1] 3 . 

1. High term frequency: The same term repeatedly occurs in a long-length document. 
As a result, the term frequency factors may be large for long documents, increasing 
the average contribution of its terms towards the query-document similarity. 

3 Robertson and Walker mentioned two types of reasons as scope hypothesis and verbosity hy¬ 
pothesis, respectively [2], 



2. More terms: Long-length document has large size of vocabulary. This increases the 
number of matches between a query and a long document, increasing the query- 
document similarity, and the chances of retrieval of long documents in preference 
over shorter documents. 

Without loss of meaning, we can conceptualize these two reasons as verbosity and 
multi-topicality. First, verbosity means that the same topic is repeatedly mentioned 
by terms related to the topic, making term frequencies high. Second, multi-topicality 
indicates that a document has a broad discussion of multi-topics, rather than single 
topic, making more terms. Using these concepts, we divide long-length documents into 
two different ideal types - verbose documents and multi-topical documents. Verbose 
document is the document which becomes long mainly due to verbosity, rather than 
multi-topicality, while multi-topical document is the document which follows typical 
characteristics of multi-topicality, rather than verbosity. 

Singhal pre-assumed that long-length documents should be penalized regardless 
of whether or not their types are verbosity (or multi-topicality) [1], Basically, their 
approach belongs to a simplified length-driven method which decreases the term fre¬ 
quency of all long-length documents according to documents’ length factor only. How¬ 
ever, we insist that this Singhal’s pre-assumption would be failed. We argue that the 
penalization should be applied to verbose document only, not to multi-topical docu¬ 
ment. As a main reason, terms in a multi-topical document are less repeated than ones 
in a verbose document, since the length of the multi-topical document is increased due 
to its broad topics. However, Singhal missed this point that these types of documents 
should be differently handled. Therefore, the retrieval function adopting Singhal’s pe¬ 
nalization will make multi-topical documents unreasonably less-preferred, causing an 
unfair retrieval ranking. 

To clearly support our argument for verbose document and multi-topical document, 
we will exemplify two different situations to discuss different tendencies of term fre¬ 
quencies in verbose document and multi-topical document. First, let us examine the 
situation by considering two different document samples of D\ and Z)? which have the 
same term frequency ratio. 



£>2 is twice the concatenation of D\. Suppose that a query is given by “language 
modeling approach”. Then, a question arises as “which one of D\ and ZL is more rel¬ 
evant?”. By comparing the contained information, we know that two documents have 
the exactly same contents, although the length of Z >2 is twice than that of D\. Thus, D\ 
and Z >2 should have the same relevance score. However, the absolute term frequency of 
Z >2 is twice than that of D\, thus, the naive TF • IDF prefers /L to D\. To avoid this 
unfair comparison, we should introduce a TF normalization. To this end, suppose that 
l is the length of documents, and tf is the term frequency of a query term. Then, one 
reasonable strategy of TF normalization is to use tfn = tf //, instead of tf. Then, the 
modified TF • IDF produces the same score for D\ and I)i. Note that Singhal’s pivoted 
length normalization will also well-work since tfn can be well-reflected in Singhal’s 
original formula. Remark that ZB is a verbose document, not a multi-topical document, 





which is the main reason for the success of the normalization. Now, we examine the 
second situation by considering a multi-topical document sample £>3, which contains 
all topics of D 1 and £)? as a subpart. 

£) 3 : Information retrieval model 
Language modeling approach 

Here, £>3 describes a broad topic - “information retrieval model”, and contain “language 
modeling approach” as a subtopic. Again, suppose that the same query of “language 
modeling approach” is given. Consider the question about “what relevance score should 
assigned to Dj be, compared with D\ and £>2?”. £>3 contains all contents of D\ and £>2, 
although £>3 is different from D\ and £>2. In this case, if user sees i) \, he or she would 
think that i)\ is also relevant, because all relevant content - D\ - is embedded to £>3. 
From this viewpoint, £>3 should have the same score as D\ and £)? (due to a partial 
relevance). However, if we apply the previous version of TF-normalization (i.e. tfn = 
tf /l) to £>3, then £>3 is much-less preferred to D\ and £>2, since its term frequency of a 
query term is the same as £>i but its length is twice than that of D\. Of course, Singhal’s 
method will assign less-score to £>3 than D\ and £)?. The mean reason of this failure is 
that £>3 is not a verbose document but a multi-topical document. This result means that 
TF normalization problem is more complex, at least requiring the different strategies 
according to types of long-length documents. To avoid the unreasonable penalization 
for multi-topical ones, TF normalization problem should be more deeply re-investigated 
by discriminating multi-topical documents from verbose documents. 

To obtain a more accurate TF normalization, we propose a novel TF normalization 
method which is a type of axiomatic approach. We try to modify language modeling 
approach as a case study without the loss of its elegance and principle. To this end, 
we first formulate two constraints that the retrieval scoring functions should satisfy for 
verbose and multi-topical documents, respectively. Then, we present the analysis result 
that previous language modeling approaches do not sufficiently satisfy these constraints. 
After that, we modify the language modeling approaches such that better satisfy these 
two constraints, derive a novel smoothing methods, and evaluate the proposed ones. 

2 Formal Constraints of New TF Normalization, and Analysis of 
Previous Language Modeling Approaches 

2.1 Constraints 

From now on, we assume that x(£)) is a measurement for calculating the number of 
topics in document £). We define K-verbosity and N-topicality as follows. 

Definition (K-verbosity): Suppose that D\ and £L are given. Let tf \ (vv) and t f'2 ( w) 
be the term frequency of term w in D\ and £)?, respectively. For all term w, if / /o (vv) = 
K ■ tfi (vv) and x(D\) = t(£> 2). then £)? has K-verbosity to D\ or £)? is K-verbose to D\. 

Definition (N-topicality): Suppose that D\ and £>? are given as x(£>?) = N ■ x{D\ ). 
Let l\ and £ be the length of D\ and £)?, respectively. If for all term w in D \, f /'o (vv)/£ 
= tf\ ( w)/l\/N , then £)? has N-topicality to D\ and £>2 is N-topical to D\. 

In our three samples from the introduction, £>2 has 2 -verbosity to D\, and £>3 has 2 - 
topicality to D\. Remind that we have argued that £>1, £>2 and £>3 should have the same 



relevance score. This argument can be re-formulated to following two constraints - VNC 
and TNC which the retrieval function should satisfy for two cases when one document 
has K-verbosity and N-topicality to another document, respectively. Let score(Q,D ) be 
a similarity function between a document D and a query Q. 

VNC (Verbosity Normalization Constraint): Suppose a pair of D\ and Do. If Do 

is K-verbose to Z>i, then score(Q,D\) = score{Q,Do). 

TNC (Topicality Normalization Constraint): Suppose a pair of D\ and Do. If Do 

is N-topicality to D\, then score(Q,D i) = score(Q,Do). 

These constraints can be directly utilized to derive a new class of retrieval function 
as Fang’s exploration [ 3 ], Originally, Fang formulated two constraints related to term 
frequency - LNC 1 and LNC 2 [ 3 ], Among them, LNC 2 is highly relevant to VNC, where 
VNC is a more specific constraint - VNC entails LNC 2 , not vice versa. TNC is a new 
constraint which is not connected to Fang’s any constraint. Note that our exploration of a 
retrieval function is different from Fang’s one. We focus on only few constraints related 
to our issue, without identifying all constraints. Then, we select as the backbone model 
one among a previous well-performed retrieval model, and modify it to better satisfy 
the focused few constraints, without losing the elegance and the principle of the orig¬ 
inal model. In this regard, our exploration method belongs to the partially-axiomatic 
approach - 1) using partial constraints rather than full constraints, 2) using the restricted 
functional space which the backbone retrieval model can allows, rather than relying on 
full functional space. In contrast. Fang’s approach is the fully-axiomatic approach [ 3 , 
4 ], In Fang’s approach, full constraints are completely identified as well as the focused 
constraints. A new class of retrieval function is explored as one in separate functional 
space which is not related to previous retrieval models. However, the fully-axiomatic 
approach such as Fang’s exploration approach requires un-principled heuristics which 
are not derived from a well-designed retrieval model. A partially-axiomatic approach 
doesn’t need to discard the well-founded retrieval model such as language modeling 
approach, enabling us to pursue a more elaborated retrieval model, without losing its 
mathematical elegance and principles. 


2.2 Analysis of Language Modeling Approaches 


We selected the language modeling approaches as the backbone retrieval model [ 5 ], 
Our goal is to modify the language modeling approaches such that better satisfies the 
proposed two constraints - VNC and TNC. We investigate two popular smoothing meth¬ 
ods - Jelinek-Mercer smoothing (JM) and Dirichlet-prior smoothing (Dir) [6], Before 
modifying them, we begin by discussing whether or not each smoothing method satis¬ 
fies VNC and TNC in this subsection. Notations used in this paper are summarized as 
follows: 



Q 

A given query 

tfp(w) 

Term frequency of w in document D 

Id 

Length of document D 

tf c (w) 

Term frequency of vv of collection 

lc 

Total term frequency of collection 

6 d 

Smoothed document language model of D 

0 D 

Unsmoothed document language model of D (MLE) 

0 c 

Collection language model (MLE) 


Analysis of Jelinek-Mercer Smoothing In JM (Jeliner-Mercer Smoothing), a smoothed 
document model is obtained by the interpolation of MLE (Maximum Likelihood Esti¬ 
mation) of a document model and the collection model as follows [6]: 

P(w\Q d ) = {l-X)P{w\Q D )+lP(w\e c ) (1) 


where A- is a smoothing parameter. By using JM, score[Q,D), the similarity score of 
document D for query Q can be written by using only query-matching terms as follows: 


<Q,D) = E log 

we Q 


1 AP(m’|0d) , \ , 


1 — Xtfp(w) lc 
X l D tfc(w) 


+ 1 


( 2 ) 


Our analysis of whether or not JM satisfies VNC and TNC is given as follows: 


1 . JM satisfies VNC: Suppose that EE is K-verbose to D\. Then, MLEs of two docu¬ 
ment models are the same, resulting in the same scores. 

2 . JM does not satisfy TNC: Generally, JM prefers normal documents to multi-topical 
documents, regardless of our definition of topicality measurement x. This proof is 
skipped. 


Analysis ofDirichlet-Prior Smoothing In Dir (Dirichlet-prior smoothing), a smoothed 
document model is estimated as posterior model when taking qP(w|0c) as a prior prob¬ 
ability of term w as follows [6]: 


P(w\Q d ) 

The equation is rewritten by 


t fpjw) + pP(w|0c) 
Id+H 


( 3 ) 


P(w\ 8 d ) = t— —E(w| 0 d) + P(w| 0 c ) ( 4 ) 

Id+/j Id+v 

If we set Xp by fj/(lo+fj)' then Dir is equivalent to JM-style smoothing using document- 
specific smoothing parameter Xp. scored). Q) based on Dir is formulated as follows: 


score(D,Q) = J>g ^ 

The analysis on whether or not Dir satisfies VNC and TNC is somewhat complicated, 
due to its document-specific smoothing parameter. We can easily show that Dir does 
not satisfy VNC and TNC. The following lists up the analysis result. 




1 . Dir doesn’t satisfy VNC: Generally, Dir makes inconsistent preferences according 
to whether or not a query term is topical. For a topical query term. Dir assigns the 
more score for verbose documents than normal documents. For a non-topical query 
terms. Dir assigns the less score for verbose documents than normal documents. 
The detailed proof is skipped. 

2 . Dir doesn’t satisfy TNC: The detailed proof is skipped. 

3 Modification of Previous Retrieval Models 

In the previous section, we have shown that two different smoothing methods do not 
satisfy two constraints well. In this section, we introduce the measurement of the num¬ 
ber of topics, and modify the previous retrieval model such that it better satisfies VNC 
and TNC. 


3.1 Measurement of The Number of Topics 

To figure out which measurement x(D) is acceptable to calculate the number of top¬ 
ics in document D, we propose two simple measurements for x(D) - The first one is 
vocabulary size, and the second one is information quantity. 

Vocabulary Size: Generally, as there are more terms, a given document has more 
topics. Based on this idea, we can use the vocabulary size - v(Z>) - which indicates 
the number of unique terms in a given document, as a measurement for the number of 
topics. 

Information Quantity: Even though the vocabulary size is simple and reasonable, 
it cannot discriminate the mainly topical terms from the causally-occurred terms. When 
using the vocabulary size, the number of topics may be unreasonably increased due to 
causally occurred terms. As for an alternative measurement, we consider the entropy- 
driven value. Remind that entropy means the uncertainty of a generated sample. Entropy 
has the following positive properties for resolving the limitation of the vocabulary size. 
1 ) As the number of possible events increases, entropy becomes larger. Here, events 
correspond to terms, hence the more terms are, the larger the entropy is likely to be. 
Thus, when a document has more topics, the content of the document can be described 
in more various ways, resulting in a larger entropy value. 2 ) Term generative probability 
of a document is used as the weight for calculating entropy value. As a term has more 
large probability, it makes more contribution to the final-entropy value. This property 
allows us to differentiate the effects of mainly topical terms and causally occurred terms. 

The information quantity - e(D) - is defined as an exponential function of entropy 
of a document as follows: 

x(D) = e(D) = exp f-£>(w| 0 D )logP(w| 0 £i) 

\ W 

Some Useful Definitions: We define some useful notations. Let us define the nor¬ 
malized measurement of the number of topics - f (D) -, and define the informative 
verbosity - to (D) - as follows: 



t'(D) = x(D)/x, co(D) = I d /x{D) 


where x is the mean of x(D) for all documents in a given test collection. Note that the 
informative verbosity indicates the average term frequency per unit information. 


3.2 Modification of JM 

First Modification of JM Since JM exactly satisfies VNC, we would try to modify 
JM to additionally support TNC. The core idea of the modification of JM smoothing 
is a pseudo document. The pseudo document mainly consists of relevant parts to a 
query, which is constructed by extracting relevant parts from non-relevant parts. Then, 
the score of a document is calculated by using the pseudo document model, instead of 
original document model. 

Thus, the pseudo-document makes us take a dynamic viewpoint of document rep¬ 
resentation where a document is dynamically changed according to a query. Note that a 
pseudo document is an imaginary concept, which is not really constructed at real time. 
All we require is generative probabilities for query terms from the pseudo document 
model. 

To estimate probability of query terms in a pseudo document, we simplify the esti¬ 
mation problem by using probability in original document. In other words, for terms in 
the pseudo document having non-zero probabilities, their probabilities are assumed to 
be proportional to the probabilities of terms in the original document. As a result, the es¬ 
timation problem is completed only if we determine the length of the pseudo document 
from the original length Id- 

Intuitively, the length of the pseudo document will be smaller, as topics are more. 
This intuition makes the length of the pseudo document proportional to Id/x(D). Thus, 
if Qp S eudo(D) is the language model of pseudo document, then the probability of pseudo 
document model is 


P{w\ Qpseudo(D)) x tf D (w)/l D /x(D) = tf D (w) -x(D)/Id 
It is rewritten by using x'(D) instead of x(D), and the constant K as follows: 

P(w\ Q Pseudo(D)) = K ' tf D (w) ■ X'(D)/1 D 


If we assume that the constant K is independent to any document and query, then K is 
not a tuning parameter since it can be included in smoothing parameter A. 

Let us derive a modified JM by substituting the original document model to this 
pseudo document model in Eq. ( 2 ). Then, score(Q,D) is reformulated as follows: 


score(Q,D) = £ log 
we Q 


1 ~ Ap K-x'(D) -tf D (w) lc i 
An Id tf c (w) 


( 5 ) 


where Ao is another smoothing parameter for the pseudo document model. Since K is 
independent to any document and query, we can select A such that (1 — Ao )K : Ao is 



(1 — X ): X, in order to eliminate constant K. Then, Eq. ( 5 ) is re-written by 


?(Q, D ) = E 

weQ 


1 -Xx'{D)-tf D (w) l c 


■ + 1 


X l D tfciyv) 

By using MLE of the original document model / J f vt | 0 /> j, Eq. (6) is rewritten by 


(6) 


score(Q,D ) = £ log 
weQ 




( 7 ) 


Eq. ( 7 ) is the final modified JM, which is called JMV. JMV satisfies both of VNC and 
TNC. 


1 . JMV satisfies VNC: Let Do be K-verbose to D\. Then, x(Z 5 i) = t(ZJi) and P(w\D \) 
= P(w|Di). Thus, score{Q,D\) = score(Q,D2). 

2 . JMV satisfies TNC: LetZ>3 beN-topical to D\. Then, 1(1)3) = Nx(D2) and P(w|Z>i) 
= NP(w\Dj). It makes that )P(w\D3) =x(Di)P(w|Z)i). Therefore, score(Q,D\) 
= score(Q,D3). 


Second Modification of JM In our preliminary experiments, we found that JMV per¬ 
forms well for keyword queries (i.e. title query), but is not reliable for verbose queries 
(i.e. description query), by showing serious sensitivity according to smoothing parame¬ 
ter X. To discuss the reason of this result, we focus on the main differences of keyword 
query and verbose query. First, there are common terms in a verbose query. Different 
from topical terms, common terms can be shared by all topics. A common term always 
verbosely acts regardless of verbose documents and multi-topical documents. Thus, the 
previous TF normalization would prefer multi-topical documents for queries including 
common terms. Second, verbose queries often contain noise terms such as “relevant”, 
“find” and “documents”. When a document has more topics, it will increase the chance 
of existence of such noise terms. However, when our previous TF normalization is ap¬ 
plied, noise term becomes very serious, because the number of topics is further mul¬ 
tiplied to the normalized term frequency. Thus, the previous TF normalization would 
increase the scores of multi-topical documents for noise queries. These two differences 
may be the reason why Singhal et. al. penalized even multi-topical documents, as well 
as verbose documents [ 1 ], However, we already discussed that their approach is not 
acceptable to topical terms. 

To handle the problems of verbose-type queries, our TF normalization should be 
restricted to only document-specific terms, not to noise terms or common terms. As a 
query term is more topical term in a given document, we hope to perform more TF 
normalization, and vice versa. To this end, we define s(w,D) as term specificity of w 
in document D. As for s(w,D) this paper uses a probabilistic metric P(D\w ) which is 
defined as follows: 


XsPjwfip) _ 

X s P{w\Q D ) + (l-X s )P{w\Q c ) 


s(w,D) = P{D\w) 



where X s is an additional smoothing parameter, which has 0.25 as the default value. 
By using the term specificity s(w,D), we newly modify the pseudo document model as 
follows: 

P(MVpseudo(D)) = K ■ tf D (w) • X '(DfW^/lo (8) 

Since P(D\w) is between 0 and 1, the normalization is perfectly reflected when P(D\w) 
is 1, while it is weaken as P{D\w) is close to 0. One problem arises when l'(D) is smaller 
than 1. In this case, as P(D\w) is larger, the effect of normalization becomes weaker. 
To resolve this problem, we considered the exceptional TF normalization, making the 
normalization proportional to P(D\w) even when x' (D) is smaller than 1. In preliminary 
experiments, we found that the final retrieval performance is almost not changed, even 
after the exceptional TF normalization is applied. Thus, we select Eq. (8) for second 
modification. We call it JMV2. 

4 Modification of Dir 

Our goal for Dir modification is to provide VNC. We introduce the concept of pseudo 
document model to modify Dir. Different from the pseudo document for JM modifica¬ 
tion that consists of query-relevant parts only, the pseudo document for Dir modification 
consists of all topics in the original document, but has a different length from the orig¬ 
inal length. Note that the change of the length only makes different models, since the 
smoothed model - P(w\Qu) - is different according to the document length. In fact, the 
length-dependence was the main reason why Dir does not satisfy VNC. 

We assume that the pseudo document model is proportional to original MLE doc¬ 
ument model. In addition, we set the length of the pseudo document by x(D). Remind 
that informative verbosity - co(Z)) - is defined as Id/x(D). That is, the pseudo docu¬ 
ment with length of x(D) compacts the original document with length In by to (D) time. 
Therefore, each term w of document D has the following term frequency in the pseudo 
document. 

tfp S eudo(D) O) = tf D {w)/a>{D) (9) 

As a result, the pseudo document model becomes length-independent model, even 
though MLE of pseudo document model is the same as the original document model. 
By using pseudo document model. Dir produces the following smoothed model. 

^ tf Pseudo{D] {w)+nP{w\Q c ) 

P{w\®Pseudo(D)) = - z(D)+fl - (10) 

By substituting Eq. (9) to Eq. (10), Eq. (10) becomes 

PMVpseudom) = + ( 11 ) 

This final modified model can be viewed as JM-style smoothing using document-specific 
smoothing paramter Xd with /j/(x(D) +/r), which is not dependent to the length any 
more. We call this modification DirV. We can easily prove that DirV additionally satis¬ 
fies VNC. 



1. DirV satisfies VNC: Let /L be K-verbose to D\. Then, two MLE models are equal 
(i.e P(h’| 0£>, ) = P(w|0£),)). A,/), is Xd 2 since %(D\) and t(Z> 2 ) are the same. Thus, 
DirV gives the same score for D\ and /L. 

2. DirV does not satisfy TNC: For DirV, we do not have a special consideration for 
supporting TNC. 

5 Experimentation 

5.1 Experimental Setting 

For evaluation, we used five TREC test collections. The standard method was applied to 
extract index terms; We first separated words based on space character, eliminated stop- 
words, and then applied Porter’s stemming. Table 1 summarizes the basic information 
of each test collection. In columns, #Q, Topics , #R, #Doc, avglen , and #Terms are the 
number of topics, corresponding query topic IDs, the number of relevant documents, 
the number of documents, the average length of documents, and the number of terms, 
respectively. 


Table 1 . Collection summaries 


Collection 

#Q 

Topics 

# R 

# Doc 

avglen 

# Term 

TREC7 

50 

350-400 

4.674 

528,155 

154.6 

970,977 

TREC 8 

50 

401-450 

4,728 

WT2G 

50 

401-450 

2,279 

247,491 

254.99 

2,585,383 

TREC9 

50 

451-500 

2,617 

1,692,096 

165.16 

13,018,003 

TREC 10 

50 

501-550 

3,363 


According to Zhai’s work [6], we used the following three different types of queries: 

1) Short keyword (SK): Using only the title of the topic description. 

2) Short Verbose (SV): Using only the description field (usually one sentence). 

3) Long Verbose (LV): Using the title, description and the narrative field (more 
than 50 words on average). 

As for retrieval evaluation, we used MAP (Mean Average Precision), Pr@5 (Preci¬ 
sion at 5 documents), and Pr@10 (Precision at 10 documents). 

5.2 Experimental Results 

Table 2 shows the best performances (MAP, Pr@5, Pr@10) of DirV and JMV2, com¬ 
pared with Dir. As for topic measurement x(D), we selected the information quantity 
(e(D)) since JMV2 and DirV using the information quantity is better than those using 
vocabulary size. We used MLE (Maximum Likelihood Estimation) for P(w|0£>) to cal¬ 
culate the information quantity without any smoothing. We selected Dir as the baseline 
due to its superiority over JM in all test collections. To obtain the best performance of 
each run, we searched 20 different values between 0.01 and 0.99 for X, and 22 values 
between 100 and 30,000 for p. To check whether or not the proposed method (DirV and 







































Table 2. Performances of Dir, DirV and JMV2 (MAP, Pr@5, Pr@10). Bold faced numbers indi¬ 
cate runs showing significant improvement over Dir. 


MAP 

Dir 

j DirV 

JMV2 | 

SK 

SV 

LV 

SK 

SV 

LV 

SK 

SV 

LV 

TREC7 

0.1786 

0.1790 

0.2209 

0.1835 

0.1967$ 

0.2348$ 

0.1825 

0.1926$ 

0.2250 

TREC8 

0.2481 

0.2294 

0.2598 

0.2492 

0.2393$ 

0.2621$ 

0.2505$ 

0.2354$ 

0.2500 

WT2G 

0.3101 

0.2854 

0.2863 

0.3125 

0.3103$ 

0.3267$ 

0.3278$ 

0.3112$ 

0.3263$ 

TREC9 

0.2038 

0.1990 

0.2468 

0.2040 

0.2336$ 

0.2581$ 

0.2068 

0.2245$ 

0.2494 

TREC10 

0.1950 

0.1865 

0.2347 

0.2049t 

0.2248 

0.2640 

0.2091 

0.2133$ 

0.2555 

Pr@5 

1 Dir 

j DirV 

J JMV2 ! 

SK 

SV 

LV 

SK 

SV 

LV 

SK 

SV 

LV 

TREC7 

0.4400 

0.4280 

0.5240 

0.4560 

0.4840$ 

0.5680$ 

0.4680 

0.4920$ 

0.5800$ 

TREC8 

0.4920 

0.4320 

0.5120 

0.5120 

0.5040$ 

0.5360 

0.5240$ 

0.4880 

0.5280 

WT2G 

0.5160 

0.5120 

0.5280 

0.5360 

0.5520 

0.5720$ 

0.5400 

0.5560 

0.5920$ 

TREC9 

0.3000 

0.3480 

0.4160 

0.3320 

0.4240$ 

0.4320 

0.3440 

0.3720 

0.3880 

TREC10 

0.3520 

0.4040 

0.4720 

0.3840 

0.4520 

0.4920 

0.3800 

0.4200 

0.4880 

Pr@10 

Dir 

j DirV 

| JMV2 | 

SK 

SV 

LV 

SK 

SV 

LV 

SK 

SV 

LV 

TREC7 

0.3980 

0.4120 

0.4420 

0.4180t 

0.4420 

0.4720$ 

0.4100 

0.4440 

0.4800$ 

TREC8 

0.4460 

0.4120 

0.4660 

0.4740t 

0.4380 

0.4780 

0.4700$ 

0.4400 

0.4480 

WT2G 

0.4660 

0.4220 

0.4240 

0.4840 

0.4840$ 

0.4800$ 

0.4920 

0.4900$ 

0.4820$ 

TREC9 

0.2560 

0.2860 

0.3160 

0.2780 

0.3260$ 

0.3540$ 

0.2780 

0.3160$ 

0.3220 

TREC10 

0.3060 

0.3500 

0.4040 

0.3300 

0.3820 

0.4340 

0.3300 

0.3700 

0.4340 


JMV2) significantly improves the baseline, we performed the Wilcoxon sign ranked test 
to examine at 95% and 99% confidence levels. We attached f and $ to the performance 
number of each cell in the table when the test passes at 95% and 99% confidence level, 
respectively. The results are summarized as follows: 


1. DirV significantly improves MAP of Dir for verbose type of query (SV and LV). 
Exceptionally, TREC10 did not show an improvement for verbose type of query. 

2. DirV does not significantly improve MAP of Dir for keyword type of query (SK), 
but improves precisions (Pr@5 or Pr@10). Especially, on TREC7 and TREC8, 
Pr@10 is significantly improved over Dir. Although other test collections do not 
statistically show a significant improvement, there is large portion of the numerical 
increase. 

3. DirV or JMV2 show improvement on a specific test collection even for keyword 
type of query. For DirV, TREC10 is such a collection by showing a significant 
improvement of MAP. For JMV2, WT2G is such a test collection. 

4. Overall, DirV is slightly better than JMV2 in most of test collections. WT2G is an 
exceptional collection to show that JMV2 significantly improves DirV. 




































































































































































































6 Conclusion 


This paper introduced a new issue for TF normalization by considering two different 
types of long-length documents - verbose documents and multi-topical documents. We 
proposed a novel TF normalization method which uses a partially-axiomatic approach. 
To this end, we formulated two desirable constraints, which the retrieval function should 
satisfy, and showed that previous language modeling approaches do not satisfy these 
constraints well. Then, we derived novel smoothing methods for language modeling 
approaches, without losing basic principles, and showed that the proposed methods sat¬ 
isfies these constraints more effectively. Experimental results on five standard TREC 
collections show that the proposed methods are better than previous smoothing meth¬ 
ods, especially for verbose type of query. JMV2 significantly improved JM for all type 
of queries, and DirV eliminated the limitation of Dir by providing the robustness of per¬ 
formances for verbose type of query, as well as improving precisions (Pr@5 or Pr@ 10) 
for keyword type of query. This is comparable to recent results using more complicated 
query-specific smoothing based on Poisson language model [7], 

To handle long-length documents, passage-based retrieval could be applied [8], 
However, passage-based retrieval has a burden of decreasing efficiency, since it requires 
additional process such as indexing of position information, pre-segmenting individ¬ 
ual passages, and more importantly the additional overhead at online retrieval time. 
Contrast to the complicated method such as the passage retrieval, this paper handles 
multi-topical documents in a simplified manner by investigating a more accurate TF 
normalization without additional cost of efficiency. 
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